L10 - Back to ggplot2

Data Computing Chapter 8

Presenter: Olivia Beck
Content credit: Matthew Beckman

Agenda

Reminders

PackageName::FunctionName

library(tidyverse)
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#?filter

#filter from dplyr
diamonds %>%
  dplyr::filter(color == "E") %>%
  head()
## # A tibble: 6 × 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.22 Fair    E     VS2      65.1    61   337  3.87  3.78  2.49
## 5  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
## 6  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
#normal
diamonds %>%
  filter(color == "E") %>%
  head()
## # A tibble: 6 × 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.22 Fair    E     VS2      65.1    61   337  3.87  3.78  2.49
## 5  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
## 6  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68

Building Graphics

  1. Draw by hand (or imagine) the specific plot that you intend to construct
  2. Data Wrangling (if needed) to get the data in glyph-ready form, or verify that the current form is glyph-ready for your purposes.
  3. Establish the frame using a ggplot() statement
  4. Create the intended glyph using geom_[style]() such as
    • geom_point()
    • geom_bar()
    • geom_boxplot()
    • geom_density()
    • geom_vline()
    • geom_segment()
    • geom_histogram()
    • and many more
  5. Map variables to the graphical attributes of the glyph using: aes( )
  1. Add additional layers to the frame using the + symbol
    • Note: not %>% between layers of ggplot2 graphics
    • Think + is equivalent of “add layer on top of …” in ggplot2 portions, whereas %>% is “and then the next step is…”

Steps 4 and 5 can be switched.

https://twitter.com/tanya_shapiro/status/1576935152575340544?t=vwaW8h6CC62h0pkwv9n5Yg&s=19

Example: Baby Names

Let’s look at our BabyNames names data set agian.

# data intake
data("BabyNames", package = "dcData")

# inspect data intake
glimpse(BabyNames)
## Rows: 1,792,091
## Columns: 4
## $ name  <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida"…
## $ sex   <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ count <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258…
## $ year  <int> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…

wrangle into glyph-ready form

names <- c("Olivia", "Zoe", "Quentin")

Names <- 
  BabyNames %>%
  filter(name %in% names) %>%
  group_by(name, year) %>%
  summarise(total = sum(count, na.rm = TRUE))

Names %>%
  head()
## # A tibble: 6 × 3
## # Groups:   name [1]
##   name    year total
##   <chr>  <int> <int>
## 1 Olivia  1880    44
## 2 Olivia  1881    51
## 3 Olivia  1882    52
## 4 Olivia  1883    46
## 5 Olivia  1884    54
## 6 Olivia  1885    59

in the beginning you might use mplot to get started–here’s the default result

The graph looks perfectly fine, but this code isn’t easy to read.

This is why we stress writing readable code!

ggplot(data = Names, aes(x = year, y = total)) + geom_line()  + aes(colour = name) + theme(legend.position = "right") + labs(title = "")

we can do better

  1. establish the frame
  2. plot the glyphs (i.e., select a geom)
  3. map the aesthetics
  4. add labels and title
  5. other features (e.g., alpha, sizing, etc)

Our Plot

  1. Establish the Frame

Nothing is here! That is exactly what is supposed to happen. Calling ggplot() only tells us R that we are ready to plot and I want to create some space to make my plot.

ggplot(data = Names) 

  1. plot the glyphs (i.e., select a geom)

Still Nothing! We need to tell it what our axis are.

Note that ggplot uses +, NOT %>%. This is because we are adding layers to our plots.

ggplot(data = Names) + 
  geom_line()
## Error in `geom_line()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_line()` requires the following missing aesthetics: x and y

Note - this is why I like to map aesthetics first, so we can avoid errors.

  1. Map the aesthetics

Rule of thumb: anytime when you are plotting with ggplot, ALL variables need to be inside an aes (except facets, later in slides), and all constants go outside of the aes.

#not Quite
ggplot(data = Names) + 
  geom_line( aes(x = year, y = total)) 

#add groups
ggplot(data = Names) + 
  geom_line( aes(x = year, y = total, group = name)) 

#add color
# note that color includes the groups argument but not vice versa! 
ggplot(data = Names) + 
  geom_line( aes(x = year, y = total, color = name)) 

  1. Add labels and title
ggplot(data = Names) + 
  geom_line( aes(x = year, y = total, color = name)) +
  ggtitle("Names Over Time") +
  xlab("Year") +
  ylab("Popularity") +
  guides(color = guide_legend(title = "Siblings Names" ))

  1. other features (e.g., alpha, sizing, etc)
ggplot(data = Names) + 
  geom_line( aes(x = year, y = total, color = name, linetype = name)) +
  ggtitle("Names Over Time") +
  xlim(c(1972, 2022))+
  xlab("Year") +
  ylab("Popularity") +
  guides(color = guide_legend(title = "Siblings Names" ), 
         linetype = guide_legend(title = "Still Siblings Names" ))
## Warning: Removed 252 rows containing missing values (`geom_line()`).

Remarks about faceting: facet_wrap()

The syntax for facets requires a formula syntax we haven’t seen much yet. There are two main ways to plot with facets. Here are a few pointers:

Facet Wrap

data("NCHS")

# `!is.na(smoker)` finds cases that are non-missing for `smoker` (i.e. removes NA's)
Heights <- 
  NCHS %>%
  filter(age > 20, !is.na(smoker)) %>%   
  group_by(sex, smoker, age) %>%
  summarise(height = mean(height, na.rm = TRUE))

head(Heights)
## # A tibble: 6 × 4
## # Groups:   sex, smoker [1]
##   sex    smoker   age height
##   <fct>  <fct>  <dbl>  <dbl>
## 1 female no        21   1.60
## 2 female no        22   1.62
## 3 female no        23   1.61
## 4 female no        24   1.62
## 5 female no        25   1.63
## 6 female no        26   1.62
Heights %>%
  ggplot(aes(x = age, y = height)) +   
  geom_line(aes(linetype = smoker)) +   
  facet_wrap( ~ sex)

Facet Grid

Heights %>%
  ggplot(aes(x = age, y = height)) + 
  geom_line(aes(linetype = smoker)) + 
  facet_grid(sex ~ .)

Heights %>%
  ggplot(aes(x = age, y = height)) + 
  geom_line() + 
  facet_grid(sex ~ smoker)

Difference between color and fill

library(mosaicData)

head(CPS85)
##   wage educ race sex hispanic south married exper union age   sector
## 1  9.0   10    W   M       NH    NS Married    27   Not  43    const
## 2  5.5   12    W   M       NH    NS Married    20   Not  38    sales
## 3  3.8   12    W   F       NH    NS  Single     4   Not  22    sales
## 4 10.5   12    W   F       NH    NS Married    29   Not  47 clerical
## 5 15.0   12    W   M       NH    NS Married    40 Union  58    const
## 6  9.0   16    W   F       NH    NS Married    27   Not  49 clerical
CPS85 %>% 
  ggplot() +
  geom_density(aes(x = wage, color = sex), alpha = 0.4)+
  facet_grid( ~ married) +
  xlim(0,30) 
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

CPS85 %>% 
  ggplot() +
  geom_density(aes(x = wage, fill = sex), alpha = 0.4)+
  facet_grid( ~ married) +
  xlim(0,30) 
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

CPS85 %>% 
  ggplot() +
  geom_density(aes(x = wage, fill = sex, color = sex), alpha = 0.4)+
  facet_grid( ~ married) +
  xlim(0,30)
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

CPS85%>%
  ggplot(aes(x = married, color = sex)) + 
  geom_bar() +
  facet_wrap( ~ union, scales = "free")  #Note the scales here 

CPS85%>%
  ggplot(aes(x = married, fill = sex)) + 
  geom_bar()+
  facet_wrap( ~ union, scales = "free")  #Note the scales here 

CPS85%>%
  ggplot(aes(x = age, y = wage, color = sex)) + 
  geom_point()

CPS85%>%
  ggplot(aes(x = age, y = wage, fill = sex)) +  #fill does not work for points!
  geom_point()

Another Example using Diamonds Data

  1. establish the frame

  2. plot the glyphs (i.e., select a geom)

  3. map the aesthetics

  4. add labels and title

  5. other features (e.g., alpha, sizing, etc)

  6. Establish the Frame

ggplot(data = diamonds)

  1. plot the glyphs (i.e., select a geom)
ggplot(data = diamonds) +
  geom_point()
## Error in `geom_point()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_point()` requires the following missing aesthetics: x and y
  1. Map the aesthetics
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point()

  1. Add Titles and Labels
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = depth), alpha = 0.5, size = 1) +
  ggtitle("Diamonds Data") +
  xlab("Carat") +
  ylab("Price")

  1. Add additional features

Notice that I can have aes inside multiple statements. Notice that when I use constants (like alpha = 0.3, size = 0.1) they ARE NOT inside aes.

In general, variables go inside aes and constants go outside of it. (unless we are using facets then see previous materials.)

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(aes(colour = depth), alpha = 0.3, size = 0.1) +
  ggtitle("Diamonds Data") +
  xlab("Carat") +
  ylab("Price") +
  facet_grid( cut ~ color)

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(colour = "red", alpha = 0.3, size = 0.1) +
  ggtitle("Diamonds Data") +
  xlab("Carat") +
  ylab("Price") +
  facet_grid( cut ~ color)

Side Note about placement of aes

aes can either go inside the ggplot() function, or inside the geom_[chart]() function itself, or both. The 3 following options create the same plots, but the code is slightly different.

#option 1
ggplot(data = diamonds, ) +
  geom_point(aes(x = carat, y = price, color = clarity),
             alpha = 0.2, 
             size = 1) +
  geom_smooth(method = "glm" , 
              formula = y ~ poly(x, 2),                     # y = b_0 + b_1 x + b_2 x^2 + e
              aes(x = carat, y = price), 
              color = "red") +
  ylim(c(0, 20000))

#option 2
ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.2, 
             size = 1) +
  geom_smooth(method = "glm" , 
              formula = y ~ poly(x, 2),                     # y = b_0 + b_1 x + b_2 x^2 + e
              aes(x = carat, y = price), 
              color = "red") +
  ylim(c(0, 20000))

#Option 3
ggplot(data = diamonds,  aes(x = carat, y = price) )+ 
  geom_point( aes(color = clarity),
              alpha = 0.2, 
              size = 1) +
  geom_smooth(method = "glm" , 
              formula = y ~ poly(x, 2),                     # y = b_0 + b_1 x + b_2 x^2 + e
              color = "red") +
  ylim(c(0, 20000))